Multilevel Measures of Document Similarity

نویسندگان

  • Irina Matveeva
  • Gina-Anne Levow
چکیده

Many applications such as document summarization, passage retrieval and question answering require a detailed analysis of semantic relations between terms within and across documents and sentences. Often one has a number of sentences or paragraphs and has to choose the candidate with the highest level of relevance for the topic or question. An additional requirement may be that the information content of the next candidate is different from the sentences that are already chosen. Many approaches to information retrieval and document classification model the semantic similarity between documents using the relations between semantic classes of words. They include representing dimensions of the document vectors with distributional term clusters (?) and expanding the document and query vectors with synonyms and related terms as discussed in (?). Latent Semantic Analysis (LSA) (?) is one of the best known dimensionality reduction algorithms. It represents documents as vectors in the space of latent semantic concepts. Latent Dirichlet Allocation (LDA) (?) uses the latent semantic concepts as bottleneck variables in computing the term distributions for documents. The new representation captures overall semantic similarity between documents but is less sensitive to differences on the sentence level. Moreover, the methods include all vocabulary terms in their computations which limits their applicability. Semantic similarity on the word level is targeted for word sense disambiguiation (WSD), e.g. Schütze (?), verb classification XXX(cite D. Lin). The research has shown that different measures of similarity may be required for different groups of terms such as nouns and verbs. It also reasonalbe to use different notions of similarity for content bearing general vocabulary words and named entities. Methods of WSD are usually use co-occurrence statistics. Verb similarity measures is based on syntactic similarity. In this project, we propose to use a combination of similarity measures between terms to model document similarity. We divide the vocabulary into general vocabulary terms and named entities and compute a separate similarity score for each of the group of terms. The overall similarity score is a function of these two scores. In addition, we use statistical cooccurrence as well as syntactic similarity to compute the similarity between the general vocabulary terms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Document Representation and Multilevel Measures of Document Similarity

We present our work on combining largescale statistical approaches with local linguistic analysis and graph-based machine learning techniques to compute a combined measure of semantic similarity between terms and documents for application in information extraction, question answering, and summarisation.

متن کامل

بررسی قابلیت بهکارگیری سنجه های مرکزیت به عنوان شاخصهای ارتباط استنادی مدارک در بازیابی اطلاعات رابطه ای: مطالعۀ مقدماتی

Purpose: this is a pilot study tends to investigate correlation between centrality measures with bibliographic coupling as a well-known citation-based document similarity measure.  Methodology: using citation analysis method, 40 research articles belonging to four engineering/pure disciplines (Physics, Chemistry, Biology, and computer) and four Humanities and Social disciplines (Economics, Edu...

متن کامل

HESITANT FUZZY INFORMATION MEASURES DERIVED FROM T-NORMS AND S-NORMS

In this contribution, we first introduce the concept of metrical T-norm-based similarity measure for hesitant fuzzy sets (HFSs) {by using the concept of T-norm-based distance measure}. Then,the relationship of the proposed {metrical T-norm-based} similarity {measures} with the {other kind of information measure, called the metrical T-norm-based} entropy measure {is} discussed. The main feature ...

متن کامل

SOME SIMILARITY MEASURES FOR PICTURE FUZZY SETS AND THEIR APPLICATIONS

In this work, we shall present some novel process to measure the similarity between picture fuzzy sets. Firstly, we adopt the concept of intuitionistic fuzzy sets, interval-valued intuitionistic fuzzy sets and picture fuzzy sets. Secondly, we develop some similarity measures between picture fuzzy sets, such as, cosine similarity measure, weighted cosine similarity measure, set-theoretic similar...

متن کامل

Combining Multilevel and Multifeature Representation to Compute Melodic Similarity

In the proposed approach, melodic similarity is computed as a content-based information retrieval task. To this end, the initial incipit is considered as the query in a query-byexample paradigm and the ranked list of potentially similar documents is given by the list of retrieved documents. The approach to retrieval is based on document indexing, where each document is described by alternative ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006